Helga Sigríður Magnúsdóttir s202027 
                                                                             Hlynur Árni Sigurjónsson s192302
                                                                         Katrín Erla Bergsveinsdóttir s202026
                                                                            Kristín Björk Lilliendahl s192296

purple-divider

Business Question - Text Analyzis

What are Customers Liking About Pizza Restaurants in Copenhagen and is the Location of it Important?

We wanted to further analyze pizza restaurant in Copenhagen and what their reviews can tell us 🇩🇰🍕

In this business question, we are consultants for a company who wants to open a pizza restaurant in Copenhagen. The company wants the restaurant to be popular and have great reviews. Thus, we took data regarding reviews and restaurants from Trip Advisor to analyze and find out what people like about pizza restaurants in Copenhagen. Moreover, check if the location of the pizza restaurants is correlated with great reviews because the company is not sure about where to locate its new restaurant. The company also wanted to know what are customers' favorite features about restaurants that offe pizza and what do they complain about.

In detail, the company asked us to look at the positive reviews to find out what words are associated with reviews that have a rating of four and five stars. Furthermore, the company also wanted to know what to be aware of. Thus, we also analyzed words associated with okay and bad reviews.

Secondly, we analyzed if there was a connection between great/bad ratings and the location of the restaurant.

Lastly, we made a sentiment analysis on the reviews.

So what we want to analyze 🕵🏼‍♀️

Data that will be used 🥒

Contents


Let's start by importing 🐼

purple-divider

1. Descriptive Data Analysis  👩🏽‍💻

We already cleaned the data and added latitude and longitude for each restaurant available that was saved in restaurants.pkl file. Therefore, in this chapter we only needed to import the pickled file to filter the restaurants that offer pizza as CousineType, and merge it with the reviews pickle file 🥒

1.1 Restaurants 🏠

Read in for Jupyter and Google Colab 👓

🏘 Let's take a look at the dataframe

Firstly, we wanted only the restaurants that offer pizza as one of their CousineType. Because CousineType is a list we needed to split it up by using Pandas Series 🐼

We used this code to assist in this step:

Start by checking out how many counts there is of pizza as CousineType 🍕

👆🏼 We can see that restaurants that offer pizza as CousineType are 169 within the Copenhagen area.


Let's check where pizza as CousineType ranks in regards to other CousineTypes by plotting it up:

The same figure can be found in Milestone1.ipynb where we can see that European is the most popular dish 🇪🇺

However, since we are analyzing pizza restaurants, we need to filter out all the restaurants that don't offer pizza, or at least those who don't mention it as their CousineType 🇩🇰🍕


To do that, we need to split the CousineType column up and make storeName our index

✨Looks good, we see that the Column CousineType has been split up to 4 columns [0, 1, 2, 3].

Then we search those columns to see what restaurants offer pizza 🍕🤩

We can see again that 169 restaurants have CousineType as pizza, and that is the same number as we got above in the figure, so we are on the correct path 🕺🏽

The next step is to get the reviews for those restaurants. To do so, we need to get the review database and merge it with the pizza filtered restaurants.

green-divider

1.2 Reviews 👩🏽‍🍳

Read in for Jupyter and Google Colab 👓

We start by loading in the pickled file for the reviews which was cleaned in Milestone1.ipynb and take a look at it 🥒

Firstly, we need to remove the reviews for restaurants that do not have pizza as CousineType. This is done by merging the two dataframes on the column storeName

...and check the columns in the new merged dataframe, pizza_reviews 👀

💥Awesome we have over 3500 reviews!

Now let's see how many restaurants we have associated with these reviews

We can see 👆🏼 that the restaurant Neighbourhood has the far most reviews. However, this is a short list compared to the total list of restaurants which have pizza as a CousineType.

One of the reasons we observed was, for example, Pizza 13 was in the dataframe pizza but is not in the dataframe pizza_reviews. The reason for this is that there are no reviews for Pizza 13. By doing a Google Search, we found out this restaurant is permanently closed 🏚🍕

Nevertheless, for the restaurants we have in our dataframe pizza_reviews, we have multiple reviews to do text analysis for our research 🕵🏼‍♀️

green-divider

1.3. Explore Merged Dataframe 🔍

For our initial research, exploratory data analysis can be done effectively using visualization 👀

Let's start by checking how the customers' reviews are distributed between rating 1 to 5 ⭐️

As can be seen in the plot above 👆🏼 over 50% percent of the reviews have a rating of 5. Moreover, reviews with a rating of 4 are around 30%.

Next, it is good to see the average rating of restaurants with pizza as CousineType in Copenhagen over the years. Or they getting better or worse? 🤷🏽‍♀️

👆🏼 Because the plot above, is for the average rating, we don't get any ratings under 3 due to the ratings of 1 to 3 are under 20% in total. However, from the plot above, we can see that customers are giving restaurants with pizza as CousineType higher ratings in recent years.

purple-divider

2. Text Preprocessing 💬

The same steps were taken as in Lecture 2: Text Analytics, Text Preprocessing, to do the data cleaning on our reviews:

1. Removing punctuation
2. Converting text to lowercase
3. Removing stopwords
4. Stemming or lemmatizing


In order to clean the text, we first need to do some imports:

Reference: Code from Lecture 2

Now we have the relevant function to do our reviews cleaning, and therefore implement it on our dataframe pizza_reviews. The cleaned reviews will be in a column named reviews_cleaned 🧹

💥Great!

Now the pizza_reviews dataframe has the new column reviews_cleaned. We can now go into analyzing the reviews!

purple-divider

3. Analyzing the Reviews by Ratings ⭐️

In this chapter, we will split the reviews into three groups:

Each section will include:

3.1 All Ratings 🙌🏼

Firstly, let's start by analyzing the reviews as a whole despite the ratings

WordCloud ☁️

Reference: Code from Lecture 2

The picture above shows that pizza is the most frequent word, also looks like place, restaurant, food and good are very common. However, it can be hard to only visualize the words to get the correct estimation of what a good review contains.

So let's check their frequency as well.

Frequency 📊

Let's take a look at what words are the most common in the reviews.

References: Code based on code from this site

The function takes in what review dataframe you want to use, how many n-grams you want, and finally how many words you want to display, the list is sorted that the highest frequency appears first 👑

The frequency_plot function takes in the frequency list we created by using frequency function and displays the words in a histogram 👀📊


Let's start by applying the frequency function to all ratings and get the 50 most frequent words used.

We can see that the WordCloud was correct and pizza is the most common word with 3597 counts, then good is mentioned 1600 times and the next ones are food, place, and great with around 1200 counts.

We mentioned above that the word restaurant looked like being in the top 3, however that is number 6, which is still pretty good. But it tells us that WordCloud can be deceiving, perhaps because "restaurant" is a longer word than the others.


👀 Let's visualize the most frequent words using frequent_plot

Same can been seen in the WordCloud, pizza is by far the most common word.

👫🏽 Let's now take a look at the top 50 most frequent word combinations for all reviews, that is what TWO words are the most common.

We use the frequency function, put in all_ratings, and choose 2 ngrams.

👆🏼 We see that best pizza is the most common combination of two words for all reviews! That supports the rating distribution, most people who leave reviews for those restaurants are happy with the pizzas 🤩🍕

👀 Let's visualize the two most frequent words for all ratings

☝🏼 We see that best pizza is by far the most used combination for all reviews. Really good, friendly staff and good pizza come next. Other combinations look also rather positive 🙌🏼

What is also interesting here, is that the customers value the price as can be seen with the words combination value money and reasonably priced.

green-divider

3.2 Reviews: 4 and 5 Stars ⭐️

Let's now analyze only 4 and 5 stars reviews. Hopefully, we can learn something useful on how to run a successful pizza place.

Firstly, we created a column that combines great reviews, that is 4-5 stars reviews, called GreatReviews. This column is a dummy variable that tells us if it is a great review or not (1/0) 👏🏼

Let's now check the number of reviews which are 4 & 5 stars, and 3 stars and below:

👀 It's always good to visualize the results, so let's do that

As observed before, most reviews are 4 and 5 stars ratings. Therefore, we can see that the majority gives great reviews for pizza places in Copenhagen 💯🍕🇩🇰

WordCloud ☁️

Now, let's visualize the 4-5 star ratings in a WordCloud.

*The function can be found under [All Ratings - WordCloud](#WordCloud)*

We get very similar results as for all ratings, that is pizza looks like the most common word along with place, restaurant, and good.

Let's also count their frequency.

Frequency 📊

*The function can be found under: [All Ratings - Frequency](#Frequency)*

👩🏽‍🍳Start by finding the top 20 single most frequent words for good reviews

Same as for all ratings, restaurant appears more common in the WordCloud than it actually does!

Pizza appears 3026 times, good 1326 times, great 1131, place 1082 and food 1073.

👀 Again, let's visualize them

Pizza is 50% more common then the next word, interesting 💡

👫🏽 Let's now take a look at the top 20 most frequent word combinations for good reviews

Wow, the top word combinations are the same as for all reviews. That makes sense since over 50% of all reviews are rating between 4 and 5 stars 🙌🏼

👀 Let's visualize the two most frequent words

Looks like in order to get great reviews you need to at least serve great pizza and offer good service with friendly staff - that however feels kind of obvious 🤓

Also, italian food is high, so maybe the pizzas need to be more Italian rather than New Yorkers or Chicago Pizzas 🇮🇹 🤌🏼

green-divider

3.3 Reviews: 2 and 3 Stars ⭐️

WorldCloud ☁️

Let's view the 2 and 3 stars ratings in a WordCloud

*The function can be found under [All Ratings - WordCloud](#WordCloud)*

Again pizza looks like the most common word along with food and good. However, service also looks like being a common word. Therefore, it is interesting to see what word is coming before or after the word service.

But firstly, let's count their frequency:

Frequency 📊

We start by finding the top 20 most frequent words for alright reviews using the function frequency.

Alright reviews are defined as ratings from 2 to 3.

*The function can be found under: [All Ratings - Frequency](#Frequency)*

Pizza appears 467 times. Note that for great reviews it appeared 3026 times that shows how much more great reviews we have. Good 239 times, food 210, and service 164 times.

These are mainly the same words we have been dealing with, however, they are in a different order. Seeing that we have service as the fourth most common word, it is interesting to see what word they are associated with, is it good service or bad?

👀 Let's first visualize the most frequent words then we will dive into the word combinations.

The interesting words here for reviews with ratings 2 and 3, are service, as mentioned before but also table and time. These words were not appearing before when analyzing reviews with ratings 4 and 5.

👫🏽 Now, let's take a look at the top 20 most frequent word combinations for alright reviews

If we check service we can see that reviewers mean good service. However, we can also see some combinations that are not so good. They are nothing special and long time. As mentioned before, time had not appeared in the reviews for ratings 4 and 5, and therefore it is crucial to not have the customers wait a long time to get a great review.

👀 Let's plot those words

green-divider

3.4 Reviews: 1 Star ⭐️

WordCloud ☁️

Now, let's view the "bad" reviews in a WordCloud, will they be similar to what we have seen before?

*The function can be found under [All Ratings - WordCloud](#WordCloud)*

👆🏼 Alright, pizza is still the most common word. Then we have one and food.

It will be interesting to see what words they are involved with 🔍

Frequency 📊

Like we have seen the WordCloud can be a little bit misleading, so let's use the frequency function by finding the top 20 most frequent words for bad reviews

*The function can be found under: [All Ratings - Frequency](#Frequency)*

We have a lot of interesting words there that could either be something bad or good, e.g. food, service, one, waiter and minute. Moreover, the word time appears again frequently for a one-star review!

👀 Let's Visualize the most frequent words

As always, pizza is by far the most used word in these reviews 🍕📝

👫🏽 Now, let's find out the top 25 most frequent word combinations for bad reviews

Now, we are getting different results than we have been seeing! Service is now connected to service terrible, and we have really bad. Moreover, it looks like customers are getting served slowly or not at all! As most of us know, customer service is really important when running a business, and we can see that customer service is bad when restaurants are receiving reviews with 1-star rating.

👀 Finally, let's visualize the two most frequent words for bad reviews

Looks like they are all mentioning slow or bad service. However, these reviews are not that many as we found out in Exploring Merged Dataset, they are below 5% of all the reviews. Therefore, it is difficult to make a statement about these reviews. Nonetheless, they give an insight into what to be aware of when running a pizza restaurant.

purple-divider

4. Location 🗺

Let's look at the location of the pizza restaurants. 🍕 Does the location affect the reviews?

In Milestone.ipynb we gathered information about the latitude and longitude for each restaurant and saved it under restaurants.pkl as mentioned before.

We start by grouping together the restaurants found in pizza_reviews dataframe. Then, we set the storeName as an index so we can merge it with restaurants.pkl to get the latitude and longitude for each restaurant that serves pizza 🍕

👆🏼 Above, we can see all the restaurants that serve pizza along with the sum of their rating and GreatReviews; that is a count of how often they get reviews between 4 and 5 stars as found out in Reviews: 4 - 5 Stars ⭐️


Now we find the restaurant location by merging pizza_restaurants and restaurant dataframe together on storeName in a new dataframe called rest_loc:

💥 Great! Now we have the rating, latitude, and longitude available in the same dataframe. However, we also have some additional columns that we don't need for this section. We only need GreatReviews, lat, and lon, also it is good to have the storeAddress.

We drop the rest of the columns, sort GreatReviews from highest to lowest, thus the top restaurant has the highest count of good ratings. Finally, we make storeName as an index.

💥 Looks good! However, we see that one restaurant is missing latitude and longitude, Madenitaly.

So, we manually find the latitude and longitude from this website.

Also, we see that Neighborhood has two restaurants and they are located in different parts of Copenhagen, howeverm they get the same latitude and longitude, so we need to fix that as well.

For [0], Istedgade 27, Copenhagen 1650 Denmark:

For [1]: Jaegersborggade 56, Copenhagen 2200 Denmark

We can see that both restaurants contain the coordinates for Istegade, so we need to change Neighborhood at Jaegersborggade also.

💥 Great! Now, let's see how it looks like

Awesome, now the restaurants' Neighbourhood have different coordinates and Madenitaly got one 🤩


Now we need to remove storeName as an index for the next step

🌏 Looks like the dataset is ready to plot and see where the best restaurants are located.


In order to see better what are the best restaurant out of the list we have, we separate them by colors:

From the map, we can see that the best restaurant Neighbourhood, is located both in Vesterbro and Norrebro. Moreover, by looking at the map, we can see that the location of the restaurant is not relevant for getting a great review.

Lastly, we can see that one restaurant is located outside of Copenhagen, Tony's Pizzeria. This restaurant is located in Slagelse. However, by doing a simple Google search, we found out that this restaurant is also located in Vesterbro, Copenhagen. Therefore, it most likely did not get the right latitude and longitude in our dataset. In other words, it was placed in Slagelse because there is as well a Tony's Pizzeria restaurant. In this analyis the location of the restaurant doesn't seem to be affecting the review rate, and therefore we don't need to anything to correct this.

As we are still interested in finding out if the location matters and what areas within Copenhagen are trending. We will gather more data from TripAdvisor and look further into that in another Notebook 📚

purple-divider

5. Sentiment Analysis ✍🏼

As mentioned in the Introduction, we are consultants for a company that wants to know more about how to run a pizza restaurant, which gets great reviews and is therefore most likely going to be popular. The reason why many restaurants are focusing on great reviews is that today, most people look at the restaurants from the views of reviews, and decides from that where to eat. Furthermore, the company wanted to know what to be aware of. Thus, we are now going to do feature importance on the words in the reviews. That is, analyze reviews by their polarity score, and therefore find the most important words associated with positive and negative reviews.

Not only do we want to know what are the most common words are, but also what is the content of the review telling us? Sentiment Analysis can help us understand if the reviews are good or bad regardless of the review rating. It will be interesting to see if we will get the same results as before 🤷🏽‍♂️

In this section, we used the following codes from this and this websites. Furthermore, we used Lecture 2, Text Analytics, to assist us in the modeling part. However, we had to adapt and change the reference codes to our dataset 👩🏽‍💻

5.1 Classification 🤖

We have already created a column in the pizza_reviews dataframe called GreatReviews (that will be our sentiment), where if reviews are 4 and 5 stars ratings, they get 1. Conversely, reviews with ratings 3 and below, get 0.

As we saw before, positive reviews are dominant. In other words, over 83% of our dataset. 💥

Secondly, we create a new column named polarity. We import the Python library TextBlob which provides a simple API for diving into Natural Language Processing (NLP) tasks such as sentiment analysis and classification.

We use the TextBlob on our column reviews_clean to get the polarity score of each review. The polarity score is a float between -1.0 and 1.0. Therefore, the polarity score tells us if we have a negative or positive review 📝

The sign of the polarity score is often used to infer whether the overall sentiment is positive, neutral, or negative. Thus, let's now plot the polarity scores to get a better visualization of them 👀

👆🏼 As observed before, most of them are positive. That is, the reviews have over zero in the polarity score. Moreover, by hovering over our plot above, we can see that most of our reviews have a polarity score between 0.3 and 0.49 🙌🏼


Now we want to analyze the restaurants' polarity score in more detail 🔍

Firstly, we create a list of restaurants that have the highest sentiment on average, that is are most often categorized as a great review.

Now we are going to sort the restaurants by their average polarity and sort them in ascending order ⬇️

👀 Let's now plot the five restaurants which on average have the lowest sentiment score. In other words, are most often getting reviews with ratings of 3 or below

👆🏼 From the plot above, we can see that the restaurant La Vecchia Gastronomia Italiana is receiving the lowest sentiment score. However, it is over 0.6 and the reason for it is because we have by far the most positive reviews in our dataset. Furthermore, what we observe here is that this restaurant was also scoring on average the lowest polarity score.


Finally, before preparing the data for modeling, let's analyze the polarity score based on the restaurant name 👩🏽‍🍳

💥 We can see that the spread of sentiment polarity is the highest for the restaurant Neighbourhood, where the polarity score is highest 1 and lowest -1.

Next thing up is to prepare our data for modeling:

🛠 Although our model can accurately predict positive feedback, it will be difficult to predict negative ones (over 83% of our reviews are positive), so we used the RandomUnderSampling approach.

We must transform the text into numerical feature vectors before using it in machine learning algorithms. For this, we can use TF-IDF and CountVectorizer. With CountVectorizer, we can count the number of times a word appears in the text. TF-IDF accomplishes this using a statistical approach. We are going to use the TF-IDF approach.


Let's begin by building a function for the TF-IDF. Here we use the RandomUnderSampler, which deletes examples in the majority class. To train the model, we will use a test size of 20% and because the data is imbalanced we set the stratify equal to y_under. We want the proportion of positive and negative reviews to be the same.

Instead of manually implementing TF-IDF ourselves, we are going to use the class provided by sklearn, TfidfVectorizer. Our feature is the column reviews_clean and our target label is the column GreatReviews.

🙌🏼 Great - from the Counter results above we can see that we no longer have an imbalanced dataset!

green-divider

5.1.1 LSA ✏️

LSA is an information retrieval technique that examines and describes patterns in unstructured collections of text, as well as their relationships. LSA is an unsupervised method of finding synonyms in a large number of documents.

We will use the transformer TruncatedSVD. This transformer uses truncated singular value decomposition to reduce linear dimensionality (SVD). This estimator, unlike PCA, does not center the data before computing the singular value decomposition. This means it can efficiently work with sparse matrices. Let's define a function to plot LSA and apply it to our training set. We will visualize it in 2-D.

👆🏼 These embeddings are pretty cleanly separated. Therefore, let's try applying it to TF-IDF Vectorizor

From both of our LSA plots, we can see that the embeddings are more separated on our plot when using ngram_range = (1,1).

green-divider

5.1.2 Modelling 💁🏼‍♀️

Now the fun starts! We begin by defining a function for standard performance metrics to be able to evaluate the quality of the output of a classifier. Moreover, we want to plot the confusion matrix 📊

Here we begin by creating a list to gather the results from our classifiers. We are going to use:

Moreover, we use the cross_val_score to get CV predictions and StratifiedKFold to provide train/test indices to split data in train/test sets 🚂

🤖 Let's now import our classifiers and apply them to our training and test set:

🙃 Our results from our classifiers are a bit messy, therefore let's change our performance list into a DataFrame and gain an insight into them like that.

🤩 Here we can see that we are getting the best result with the LinearSVC. Therefore, let's train and test on that classifier and insert the results from it into our confusion matrix function.

👆🏼 From the confusion matrix above, we can see that our classifier creates more false negatives than false positives.

green-divider

5.1.3 Feature Importance 👌🏼

Feature importance scores are critical in predictive modelling projects because they provide insight into the data, insight into the model, and the foundation for dimensionality reduction and feature selection, all of which can increase the efficiency and effectiveness of a predictive model on the problem.

Here we are going to use Feature Importance Score to get a better insight into our data.

Let's begin by implement GridSearchCV and fit it to our best model (LinearSVC) to find the best parameters.

👆🏼 Above, we can see that the best parameter is 0.1. Moreover, we are going to use "l2" as a penalty when fitting our data to our model. We also want to use the standard performance metrics, confusion matrix.

🥳 Now we can see that our model is predicting a more balanced outcome than before!


Let's now import the eli5 Python package which helps to debug classifiers and explain their predictions.

Firstly, we start by building a vectorizer to get the feature names. That is, to be able to get readable results when using the eli5 package.

Lastly, we use the RandomUnderSample again, and insert our classifier and vectorizer into the eli5.show_weights to receive readable output from the Feature Importance.

😲 What is interesting here is that the word cold-fermented receives high importance for the positive reviews. By doing a simple Google search, we found out that cold fermentation demonstrates noticeable better flavor and better structure for the pizza dough. Moreover, beer and calzones are important for positives reviews.

The negative reviews are more regarding distasteful and comfort (most likely the place/seats were not comfortable). Furthermore, in the negative reviews we have two numbers, 500 and 450, which we assume that the customers are complaining about the price 💸

purple-divider

6. Conclusion 🍕

Analyzing the Reviews by Ratings & Location

As mentioned in the introduction, we are consultants for a company who wants to open a pizza restaurant in Copenhagen. First and foremost, the company wants the restaurant to receive great reviews. When analyzing our dataset for all the reviews, we found out that customers are giving restaurants that offer pizza higher ratings in recent years. Moreover, when exploring the word frequency for the overall ratings, we found out that pizza was the most mentioned word, followed by good, food, and place. It is important to know what words stand before or after the most frequent words. When we evaluated two words together for the overall ratings, best pizza, really good and friendly staff were used the most. This is not surprising because our dataset had over 83% great reviews.

To get more insight into the reviews, we decided to split them into three groups. The first group was great reviews, which were reviews with ratings of 4 and 5 stars. The most frequent words there were pizza, good and great. Similar words as we observed for the overall ratings, which makes sense as our dataset contained most of great reviews. Furthermore, when analyzing the most two frequent words together for great reviews, we got almost the same words as for the overall ratings. However, which was interesting here, was that the reviewers were mentioning reasonable price and value money.

The second group consisted of reviews of 2 and 3 stars. We were still getting the same frequent words but now in a different order. Therefore, it was interesting to see what words were standing with them most often. Now, we had a word combination like good service. However, word combinations like nothing special and long time begin to appear for these reviews.

Finally, the last group contained reviews with ratings of 1. For these reviews, words connected to time appeared frequently. When analyzing the word combination for 1-star reviews, we got a completely different result. Service was now connected to service terrible. Moreover, it looked like customers were getting served slowly or not at all. One more word combination which was interesting for these reviews was like popadom. Popadom is an Indian dish, and reviewers were most likely comparing their pizza with this dish.

To conclude, pizza, place, and restaurant were appearing most frequently in all our reviews. Afterthought, it would have been good practice to put these words also as our stopwords to gain better insight into the reviews. However, what we obtained from this analysis is that when running a pizza restaurant that receives great reviews, it is important to serve of course great pizza, have great service and friendly staff and sell the pizza at a reasonable price. Moreover, the data we had for analyzing the location was insufficient,so further analysis will be done in a seperated Notebook with more data.

For frequency of 1 word

# All Ratings 4-5 Stars 2-3 Stars 1 Star
1 Pizza Pizza Pizza Pizza
2 Good Good Good Food
3 Food Great Food Service
4 Place Place Service Table
5 Great Food Restaurant One

For Frequency of two words

# All Ratings 4-5 Stars 2-3 Stars 1 Star
1 Best Pizza Best Pizza Good Pizza Minute Late
2 Really Good Relly Good Food Good 10 Minute
3 Friendy Staff Friendly Staff Italian Restaurant Really Bad
4 Good Pizza Great Pizza Pizza Good 20 Minute
5 Italian Food Good Pizza Really Good Booked Table

Sentiment Analysis

In this section, we did a sentiment analysis on our reviews. We wanted to know what words were most important for great reviews and what the company should be aware of when running a pizza restaurant. We began with checking what restaurants were getting the fewest great reviews on average and analyzed their polarity score on average. The same restaurant, La Vecchia Gastronomia, had the lowest sentiment score (receiving the fewest great reviews on average) and the less polarity score.

To be able to find the feature importance, we had to build a machine learning model and use the TF-IDF approach to convert the text to a numerical feature vector. We tried multiple models and got the best results by using the LinearSVC model. We got decent predictions by using the LinearSVC model and we were able to come up with the feature importance matrix.

The word august received the highest importance in our positive reviews. It was difficult to find a good explanation why this word is important for pizza restaurants. However, we saw some interesting words such as cold-fermented, beer and calzones. Therefore, when running a pizza restaurant that receives great reviews, it is important to have a great cold-fermented recipe for the pizza dough. Moreover, offer good beers and calzones. 🍻

The most interesting feature important for the negative reviews were the numbers 450 and 500 , which are most likely connected to the price of the food/pizza. 💰
We also obtained this information when analyzing the reviews with ratings of 4 and 5 stars, that customers are price sensitive and want to buy the pizza at a reasonable price and get the value of what they are paying for. Therefore, the most likely explanation for these numbers is that the customers are paying too much for their pizza and thus the restaurant receives negative reviews. Moreover, the words distasteful and comfort were also connected to the negative reviews. The word comfort is most likely connected to that the restaurant or their seats were not comfortable 🪑

Conducting two different types of analysis gave us two different results of the "Best" pizza place in Copenhagen, as can be seen in the table below.

Doing a Text Analysis lead us to Neighborhood being the best restaurant, however, in the Sentiment Analysis, Neighbourhood, became the 13th best restaurant. The same results can be seen if we look at the top restaurant from Sentiment Analysis we see that Tony's Pizzeria ranks the highest, however it ranks 15th in Text Analysis.

Pizzeria MaMeMi & Wine Bar ranks 2nd in Text Analysis and 3rd in Sentiment Analysis. So combining those two analyses, Pizzeria MaMeMi & Wine Bar is the top pizza restaurant in Copenhagen 🤩🤌🏼🍕


# Text Analysis Sentiment Analysis
1 Neighborhood Tony's Pizzeria
2 Pizzeria Mamemi & Wine Bar Madbaren Marmorkirken
3 Trattoria Fiat Pizzeria Mamemi & Wine Bar
4 Baest Madenitaly
5 La Vecchia Signora Cafe Marzano
6 Restaurant Tio Marios Perbacco
7 Madbaren Marmorkirken Restaurant Tio Marios
8 Gorm's Magstræde Trattoria Fiat
9 Da Salvo Made in Italy
10 La Vecchia Gastronomia Italiana Leifs Pizzeria
11 Madenitaly Da Salvo
12 Perbacco Nyhavn 14
13 Made in Italy Neighborhood
14 Nyhavn 14 Gorm's Magstræde
15 Tony's Pizzeria Baest
16 Quattro Fontane Italiensk Restaurant La Vecchia Signora
17 Leifs Pizzeria Quattro Fontane Italiensk Restaurant
18 Cafe Marzano La Vecchia Gastronomia Italiana

As a result, it is essential to both analyze the most frequent words appearing for different ratings of reviews and perform a sentiment analysis on the reviews. By this, we obtain a better insight into our dataset. 🤓